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Abstract — Despite great interest in solving RNA secondary 
structures due to their impact on function, it remains an 
open problem to determine structure from sequence. Among 
experimental approaches, a promising candidate is the "chemical 
modilication strategy", which involves application of chemicals 
to RNA that are sensitive to structure and that result in 
modifications that can be assayed via sequencing technologies. 
One approach that can reveal paired nucleotides via chemical 
modification followed by sequencing is SHAPE, and it has been 
used in conjunction with capillary electrophoresis (SHAPE-CE) 
and high-throughput sequencing (SHAPE-Seq). The solution of 
mathematical inverse problems is needed to relate the sequence 
data to the modified sites, and a number of approaches have 
been previously suggested for SHAPE-CE, and separately for 
SHAPE-Seq analysis. 

Here we introduce a new model for inference of chemical 
modification experiments, whose formulation results in closed- 
form maximum likelihood estimates that can be easily applied 
to data. The model can be specialized to both SHAPE-CE and 
SHAPE-Seq, and therefore allows for a direct comparison of 
the two technologies. We then show that the extra information 
obtained with SHAPE-Seq but not with SHAPE-CE is valuable 
with respect to ML estimation. 

I. Introduction 

RNA dynamics are increasingly recognized as central com- 
ponents of cellular function, controlling key processes such 
as gene regulation, antiviral defense, and environmental sens- 
ing fl], ||2l, lis). Strong links between RNA structure and 
function underlie the importance of structural analysis, which 
greatly benefits from the wealth of information provided by 
existing and emerging chemical mapping techniques [4|. In 
chemical mapping experiments, a chemical reagent modifies 
RNA molecules in a structure-dependent fashion. Depending 
on the reagent used, four distinct types of information can 
be gleaned, including spatial nucleotide contact information, 
solvent accessibility of the backbone, the local electrostatic 
environment adjacent to each nucleotide, and local nucleotide 
flexibility ||4|, ||5l- This information is then used to infer RNA 
structural dynamics, either independently or in conjunction 
with structure prediction algorithms |(6J, [TJ. The modification 
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Fig. 1. Overview of a SHAPE-Seq chemical mapping experiment, model, 
and statistical analysis. 



location is detected by means of conversion to cDNA using 
reverse transcriptase (RT), whereby transcription is blocked 
at the sites of modification (see illustration in Fig. [T]l. This 
generates a pool of cDNA fragments that begin at the 3' 
end of the molecule and terminate at the modified sites, or 
possibly at sites where there was natural RT dropoff [SJ. 
Traditionally, the cDNA fragments have been resolved and 
quantified with capillary electrophoresis (CE) {W\, although 
recently next-generation sequencing (NGS) technologies with 
much higher throughput have been used instead 

There are several challenges in interpreting chemical map- 
ping data obtained from reverse transcription, irrespectively of 
the fragment quantification method that follows it. Primarily, in 
molecules with multiple modifications, only the first one (i.e., 
the closest to the 3' end) is revealed (see Fig. [T]i, and thus less 
information is available about the 5' region of the molecule. 
Second, RT's natural propensity to terminate at any site needs 
to be decoupled from modification-based termination, and 
this effect is controlled for in a separate control experiment. 
Finally, experimental variations need to be controlled for when 
combining measurements. 

In previous work, we introduced a stochastic model for a 
specific next-gen sequencing based chemical modification ex- 
periment called SHAPE-Seq (selective 2'-hydroxyl acylation 
analyzed by primer extension followed by sequencing) |10|. 
Here we extend that work by presenting a more general model 
that entails fewer assumptions. In addition to capturing our 
previous SHAPE-Seq model as a special case, it is suitable 
for other experimental protocols, such as SHAPE-CE (SHAPE 
followed by capillary electrophoresis). The new model has the 



added advantage that the generalization reveals a simplified 
maximum likelihood estimation scheme that leads to an ele- 
gant and fast approach for recovering chemical modification 
signal. Finally, we show how the general framework can be 
used to directly compare the power of SHAPE-CE to SHAPE- 
Seq. A key result is that SHAPE-Seq improves on SHAPE-CE 
not only by allowing for multiplexing but also by measuring 
extra information that can be utilized in the statistical inference 
framework we propose. 

II. Modeling sequencing-based chemical mapping 

We consider an RNA molecule that contains n sites, num- 
bered 1 to ri according to their sequence-position with respect 
to the molecule's 3' end (see Fig. [T]), where the 3' end is 
excluded from analysis and assumes position 0. A cDNA 
fragment of length k that maps to the sequence between sites 
and A:— l(l<fc<n)is called a k-fragment, and a full 
transcript of length n + 1 is called a complete fragment. In a 
SHAPE experiment, often called the (+) channel, the RNA is 
treated with an electrophile that reacts with conformationally 
flexible nucleotides to form 2'-0-adducts. We define the 
relative reactivity of a site, Pk, to be the probability of adduct 
formation at that site. In the control experiment, called the (— ) 
channel, the primary source of incomplete fragments is RT's 
natural dropoff while transcribing the molecule. Because its 
propensity to drop may vary along sites, we define the dropofr^ 
propensity at site k, 7fc, to be the conditional probability that 
transcription terminates at site k, given that RT has reached 
this site. Therefore, associated with the RNA molecule are 
2n probabilities: B = . . . , /3„), < /3fc < 1 V fc, and 
r = (71, . . . , 7„), < 7A; < 1 V fc, which we wish to estimate 
from sequencing data. 

While we can readily infer the natural dropoff propensities 
from the (— ) channel data alone |10|, the fragments observed 
in the (+) channel reflect the combined effects of natural 
dropoff and chemical modification. A point that is key to 
interpreting chemical mapping data is that a /c-fragment is 
assumed to be generated when site k is the site that is 
first encountered by RT, regardless of the number of adducts 
that formed upstream of k. Assuming that adduct formations 
at the various nucleotides are statistically independent, the 
probability that a molecule is modified at site k (and possibly 
also at subsequent sites) is 

Prob (first adduct at site k) = ~ ft) (1) 

1=1 

for all 1 < k < n. Incorporating the natural degradation in the 
elongating pool of modified molecules, we have 

Cfc-fragment \ '^-i fe-i 
from j =11(1- Pk 11(1- ft). (2) 
modification/ i=i »=i 

We assume all other fragments originate from natural dropoff, 
either from unmodified or modified molecules, thus accounting 



for the following probability: 

Prob (fc-fragment from natural dropoff) (3) 

( no adduct \ /no adduct \ 

. , at any site x Prob at any site 
^''''''^ ilk ) \ ilk J 

fc-l k 
= lkll{l-l^)l[{l-P^). 



i=l 



Taken together, Eqs. |2] and |3] imply 

Prob (fc-fragment in (+) channel) (4) 

fe-i 

= ^-{l-Jk){l~l3k)]l[{l-l^){l-l3^) 



for all 1 < fc < n. Finally, because complete fragments can 
only arise from natural dropoff, we have 

Prob (complete fragment in (+) channel) (5) 

n n n 

= l[^l-l^)l[i^-^3^)=l[(^-7^)il-|3^). 
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i=l 



i=l 



Assuming we observe {Xi, . . . , Xn+i) /c-fragment and 
complete-fragment counts in the (+) channel, and similarly, 
(yi, . . . ,y„+i) fragment counts in the (— ) channel, the like- 
lihood of observing the entire sequencing data is given by 



fc-i 



CiB.T) 



n [(i-(i-7fc)(i-ft)) 

k=l 
k-1 

n(i-7.)(i-ft)i 



(6) 
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Xh 



[11(1-7.)] "^'[n(i-7.)(i-ft) 

i=l i=l 

III. Maximum-likelihood estimation 



In this section, we use the likelihood formulation in Eq. |6] 
to show that the ML estimates are given by 



Xk 



Yk 



PI = max { 0, 
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Yk_ 



l<k< 



(7) 
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Moreover, as will become clear from the derivation below, 
the likelihood formulation in Eq. |6] and its optimization 
can be readily extended to accommodate data from multiple 
replicates. One can then estimate the reactivities from mul- 
tiple sources of data simultaneously and in a straightforward 
manner, and without any further assumptions or estimation of 
the statistical inter-experiment variation. 



We start by rearranging terms in the log-likelihood function 
and writing it as the following sum of n terms: 



n+l 



log£(B,r) = ^[ ^ (X,+K,)log(l-7fc) 

fc=i " ' 



(8) 



=fc+i 



n+l 

E 

i=k+l 



X,log(l-/3fe)+n-log7fc 



+Xfclog(l-(l-7fc)(l-/3fc)) 

Eq. [S] suggests that log£(i3,r) is separable in the pairwise 
variables {/3k, 7k), hence each of the n two-dimensional 
functions can be optimized separately. To simplify notation, 
we introduce the constants Sk = ^^^k+ii^i + ^i), Rk = 
Yll=k+i -^i^ functions 

lk{Pk,lk) = Sk\og{l~^jk) + Rk\og{l- 13k) (9) 
+Yk log7fc + Xk log (1 - (1 - 7fc)(l - 13k)) 

foT 1 < k < n, such that log£(B,r) = J2k=i ^k{[3k,lk)- 

We now optimize lk{f3k,lk) under the assumption that all 
fragment counts are positive. In that case, 7^ is bound to lie 
in (0, 1) and Pk cannot exceed 1, but the constraint /3fc > is 
not inherent in the function and needs to be imposed during 
optimization. If we relax it, we find the optimal solution by 
setting the partial derivatives to zero, as follows 



Rk 



Xk{l-lk) 



l-Pk 

1 - Ik Ik 

yielding the solution 



l-(l-7fc)(l- 
Xk{l-Pk) 
l-{l--fk){l- 



Pk) 
Jk) 



(10) 



= 0, 



f3k = 



1 



Ik 



(11) 



One can verify local maximality of {f3k,%) from Ik's Hessian. 
Its global optimality then follows from Z/c(/3fe, 7fc)'s continuity 
and from the fact that it approaches -^00 near the boundary 
of its domain's closure. Eq.llll clearly results in (3k < 1, but 
there is no guarantee that I3k > 0, as its numerator consists 
of two terms, each comprising of data from either the (+) or 
the (— ) channels. As such, they are not constrained to yield 
a positive difference, and might result in infeasible estimates. 
We then wish to find a feasible ML solution, and we argue 
that whenever /3k < 0, this solution is attained at 



Y.tl{x, + Y,] 



(12) 



(see Appendix for justification). This means that whenever 



we observe a site k for which 



X-k 



best explanation of the observed data is that no modification 
occurred at that site and that all fc-fragments arose from natural 
dropoff. Remarkably, this result supports existing approaches 
to analyzing SHAPE-CE data, whereby sites whose recovered 
signal is negative are assigned zero reactivity I?]. 



the 



We now allow zero counts when k < n, while assuming that 
Xn+i, Yn+i > 0. The latter assumption is justified by the fact 
that Yn+i = is indicative of severe dropoff that could stem 
from strong transcription termination at select sites or reflect 
a cumulative effect of imperfect transcription elongation over 
a long RNA strand. Both situations are avoided by truncating 
the analyzed sequence at n' < n, such that Yn'+i > 0. On 
the other hand, Xn+i = (while Yn+i > 0) suggests a "too 
high" average modification rate, leading to strong signal decay 
in the (+) channel. One should then decrease the reagent's 
concentration. Nevertheless, zero counts at intermediate sites 
are commonly observed in practice. When Xk — but Yk ^ 0, 
it is straightforward to show that the optimum is determined 
by Eq. 12 whereas the case where = and Xk 7^ is 



optimized by the initial solution in Eq. [TT] 



A. Poisson-distributed chemical modification 

In this subsection, we revisit a chemical mapping model 
that we have previously developed and used for structure 
characterization 1,10] . The model incorporates an assumption 
on the stochastic nature of the underlying chemistry, and we 
discuss its implications on ML estimation from SHAPE-Seq 
data. By casting the model as a special case of the framework 
we presented above we are able to simplify our previous 
estimation scheme to obtain closed-form ML estimates of 
relative reactivies for the Poisson model in lilOJ: 



max |o, log (1 - 



log(l- 



The work in |10| makes an assumption that is widely 
used in models of biochemical reactions, whereby the reac- 
tion with the modifying reagent follows a Poisson process. 
Specifically, during modification, an RNA may be exposed 
to varying numbers of electrophile molecules, and we model 
the number of times it is exposed to these molecules as 
a Poisson process of an unknown rate c > 0, i.e., we 
assume that Proh{i exposures) = — . It is worth noting 
that the Poisson framework is especially suitable for low- 
incidence settings ifTTl . and that mapping experiments are 
particularly calibrated to yield single-hit kinetics, that is, 
they aim to achieve an average modification rate of c « 1. 
Now, each exposure may result in the modification of a 
site, where the site is determined according to a probability 
distribution Q = {9i,...,6n), Y^^=i^k — 1, where dk 
represents the relative reactivity of site k. It is easy to show 
that the number of modifications at site k also obeys a 
Poisson distribution, with an unknown rate r^, — cdk, that is. 



Prob{i modifications at site k) 
that Pro6(site k is not modified) = 



.-cBk 



— . It then follows 
Setting 



5fc := 



1 — Prob (site k is not modified) = 1 — e 



(14) 



we can write 



Prob (first adduct at site k) 



(15) 



fe-i 



= (1 

n 

Prob{no modification) = — Pi) = ■ 



(16) 



When plugging these expressions into Eq. |6] the likelihood 
function reduces to that in ifTOl . We can therefore use the 
initial estimates in Eq. [TT] along with Eq. [14] to estimate the 
distribution Q as follows: 



log(l 



log(l 



(17) 



En+l -1^ / 6 V y 

i=k 2^i=k 

where the scaling constant c is the estimate of the average 



modification rate, which is recovered from Eq. 16 to equal 



log(l - A) = log {^^) ~ log (^frf^) ■ 



(18) 

17 and 18 are the outputs of 



It is now apparent that Eqs. 
Algorithm 1 in |10|. 

When optimization yields negative /3fc's, they correspond 
to negative 6^^ (see Eq. 14 1. However, when imposing non- 
negativity, these are projected onto = 0, and 6*^ oc 
log(l — (31) = accordingly. The revised modification rate 
estimate now amounts to c* — —J2i-0 >o^'^S(^ ~ f^i)' which 
is larger than the initial c. Consequently, the distribution 9 is 
updated such that all negative entries are set to zero, while the 
others are effectively scaled down due to the increase from c 
to c* . In other words, one merely needs to compute the relative 
reactivity estimate in Eq. 13 and then normalize it by c* to 
generate a proper probability distribution 6*. Alternatively, 
one could apply any other normalization method to the outputs 



of Eq. 13 such as the ones currently used for interpreting 



SHAPE reactivities llT2l . Q. This would retain the relativity 
between reactivities, while adjusting the dynamic range to a 
scale that is in line with current settings of subsequent structure 
prediction modules Q, llT3l . 

In practice, the formulation of site-specific modifications 
via n independent Poisson processes, as opposed to the 
multinomially-distributed choice of a site via 8, vastly sim- 
plifies the likelihood function's derivation and optimization. 
In particular, it removes the need for the iterative likelihood 
optimization routine that follows Algorithm 1 in flOl . The 
equivalence between the two formulations is an instance of 
general equivalence between multinomial and Poisson log- 
linear models with respect to ML estimation lITTl . lfT4l . 

It is interesting to compare the estimates obtained un- 
der a Poisson assumption with those obtained without it. 
To qualitatively compare them, consider the relations 0^ cx 



means that a Poisson assumption is not expected to affect sites 
with small reactivity, but on the other hand, it amplifies the 
estimated relative reactivities at more reactive sites, and more 
intensely as the reactivity increases (i.e., as 1—/?^ — ^ 0). It thus 
exerts its effect by stretching the dynamic range of reactivities, 
and thereby might confer more sensitivity to outliers. Because 
the effect's intensity depends on /3^'s magnitude, it is likely to 
be more pronounced under high modification rate conditions, 
where either the reagent concentration is high or the RNA 
entails many highly reactive sites. 

A quantitative comparison between the two models using 
experimental data for the Staphylococcus aureus plasmid 
pT181 sense RNA is shown in Fig. |2] To allow for a fair 
comparison between B* and 0*, we scaled B* such that 
its entries sum to 1. Note that the scaling factor is larger 
in the Poisson case, and thus the Poisson-based reactivities 
are smaller than their general-model counterparts at relatively 
unreactive sites. The data in Fig.|2]reveal very mild differences 
between the estimates and consequently, minor increase in the 
dynamic range. Similar results were observed for a number of 
other molecules that we have probed |9|. It is worth noting 
that the modification rate in this experiment was estimated at 
c* — 1.94 adducts per molecule, which is relatively high and 
clearly diverts from single-hit kinetics. This suggests that the 
Poisson assumption may not be critical even in the presence of 
high modification rate, and that the two model-based schemes 
may generally be used interchangeably. 

Finally, we stress that the Poisson-based correction aligns 
more closely with current CE-based analysis methodology |7|, 
ifTSl . in the sense that the signals are in fact corrected 
separately for each channel and then subtracted, along the 



lines of Eq. 13 In contrast, ML estimation under the general 



framework is not amenable to such decoupling (see Eq. 11 



\og{l-j3l), and note that -^05(1^ 



X when a; w 0. This 



Alongside this seemingly Poisson-based correction, current 
analysis guidelines also recommend using one of two outlier 
filters 1 7 1, and these may remedy the increased sensitivity to 
outliers that we highlighted earlier. 

IV. Adaptation to quantification by capillary 

ELECTROPHORESIS 

Traditional chemical mapping techniques have used elec- 
trophoresis to identify the cDNA fragments and to quantify 
their abundances |4]. Most recently, capillary electrophoresis 
(CE) has been used to detect fluorescently labeled cDNAs 
from experiments. CE systems output an electropherogram, 
consisting of analog traces that report fluorescence intensity as 
a function of time. These traces must be extensively processed 
to extract quantitative nucleotide information, and while steady 
progress is being made with the development of computer- 
aided analysis tools fl2l, fTSl, fVT\, there is still a need for 
more statistically-robust and automated analysis methods. 

Despite the challenges in analyzing CE-based data and the 
advances offered by NGS platforms, the conventional approach 
is still valuable for two reasons. First, it is currently cheaper 
and faster to apply when the multiplexing and sensitivity 
advantages of the newer platforms are not needed, and second. 
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Fig. 2. Relative reactivity estimates for S. aureus plasmid pT181 sense RNA under the general-model and the Poisson-model maximum-likelihood frameworks. 
Sites 1-45 showed negligible probabilities and were omitted from display. 



it can be used for probing a few pilot RNAs prior to conducing 
a larger-scale NGS -based experiment. These considerations 
have motivated us to adapt our SHAPE-Seq ML framework to 
the case of SHAPE-CE. While the differences in analog versus 
digital signal processing are apparent, an essential difference 
between SHAPE-Seq and SHAPE-CE data has to do with the 
lack of information about complete fragments in CE settings. 
This is due to large amounts of long fragments under single- 
hit-kinetics conditions, causing detector saturation. Notably, 
a strong full-length signal also poses difficulty in accurately 
quantifying the last stretch of 10-20 nucleotides lilSJ . but in 
this work, we only address the first issue and assume that all 
other peaks are quantifiable. 

A. Maximum likelihood framework 

Here, we show that in the absence of complete-fragment 
information, the relative reactivities at sites 1 to n — 1 are 
estimated using the following formula (1 < A; < n — 1); 



o{CE)* 

Pj, = max ' 



0, 



1 



Y^ 



where (Xi, . . . , X„) and (Yi, . . . , y„) are the areas under the 
detected peaks in the (+) and (— ) channel traces, respectively. 
We also show that p'^^^* cannot be determined from the 
available information. 

Our result is possible due to the very recent automation of 
the trace-alignment and peak-fitting steps with the HiTRACE 
software ifTTl . This, in turn, generates quantifiable nucleotide 
reactivity data, in the form of integrated peak areas, prior to 
signal correction and scaling. The peak areas can then be used 
in place of the digital sequence-counts to deconvolve the ef- 
fects of over-modification and natural dropoff on the observed 
(+) channel signal, as was recently done in |15 | by means 
of an optimization routine. This is in contrast to previous 
analysis tools, where signal correction and scaling were needed 
prior to the application of semi-manual alignment, fitting, and 
integration routines |[T2l . 

We assume that the peak areas from the (+) and (— ) 
channels are proportional to the fragment counts as follows: 
Xk = 5Xk and Yk — eYfc for all 1 < /c < n where 5 and e 
are unknown positive constants. The two potentially different 



constants reflect experimental variation between the channels, 
including differences in such factors as molecular concentra- 
tions and dye intensities lfT2l . Currently, these are corrected 
for by scaling the (— ) channel signal by a constant factor 
that is set either manually following visual inspection 112] or 
automatically via an optimization routine ifTSll . We will see 
that in our ML scheme there is no benefit in applying such a 
"correction". 

The likelihood of observing the peak areas is given by 

A,-l i^l 



n [(i-(i-7fe)(i-/3fc) 



fe=l 
fc-1 



n(i-7.)(i-A) 



i=i 

and the log-likelihood function is then written as 

ji-i 



(19) iog/:'^^(s,r) = 



-y'„iog7„ 



k=l 

+ ^l„log(l-(l-7rO(l-AO), 



(21) 



where 



f7fclog(l - 7fe) 
+Ffclog(l-/3fc) 



(22) 



-Yk log 7fe 
e 



~^Xk log 



(l-(l-7fc)(l-A)), 



and Uk = Etk+lCs^^ + \%), V, = i Er=.+i X.- 

Assuming all peak areas are nonzero, we can repeat the 
derivation in the previous section, while noting two differ- 
ences: first, different coefficients appear in the equations and 
second, the last equation (when /c = n) is different. Whereas 
the first difference is minor when all observables are positive, 
the second one leads to an important difference in the ML 
solution. Therefore, we start by optimizing 1^^ {Pk,"fk) when 
l<fc<n— Ito obtain 



x^ 2:?=. Y^ 



Yk 



7f^ 



Yk 



V" Y 



(23) 



In the special cases where X^. or are zero, but Uk and Vk are 
positive and Uu > Vk, we obtain (/3f^,7f^) = (O, fi - ) 

and (/3^^,7^^) = ,0), respectively. It can also be 

verified that these points correspond to a global maximum, 
independently of the S and e values. When (3^^ < 0, one 
can repeat previous arguments to show that the constrained 



maximum is attained at {^1^^^* ,7k^^^*) = (O, 
Taken together, these results lead to the formulation in Eq. [T9 



Uk-1 



Importantly, one cannot evaluate ^f^^^* without knowledge of 
the relative scaling factor -, however, our goal is to estimate 
the relative reactivities, and these are independent of ^. 

While the estimates in Eq. [19] bear great similarity to 
the NGS-based estimates, the case where k = n reveals 
a different scenario, as the missing Xn+i and Yn+i ham- 
per /3„'s estimation. In this case, we optimize the function 

^^^(/?«, 7n) = l0g7n + log (1 - (1 - 7n)(l - /?„)), 

which is maximized at 7„ = 1, where its value is indepen- 
dent of /3„'s value. Consequently, one cannot determine /3„. 
Moreover, one cannot recover the exact fraction of modified 
molecules, as it depends on all n reactivities as follows: 

n 

fmod = Profc (molecule is modified) = 1 — — (24) 



A possible way to circumvent this limitation is by excluding 
site n from analysis and evaluating only /Sj;'^^^*, . . . , 13,^^^* 
based on all available peak areas, including X„ and Y^- In 
practice, the effect of such omission is likely to be minor, since 
the studied RNA sequence is typically embedded in between 
auxiliary RNA constructs, called structure cassettes |8|, and 
site n is therefore included in one of these cassettes |9|. We 
can then approximate the modification fraction by /,^^ w 1 — 
nr=i^ (l — Z?!*^^'*), where the goodness of this approximation 
depends on how large Xn is in comparison to the missing 
Xn+i- This is because the approximation implicitly assumes 



that 



signal we would have Pr?^^* 



0, whereas in the presence of a full-length 



1 



which might diverge from zero whenever ^ — — diverges 

from 1. The reasoning behind setting /3„ = is twofold: 
first, our past experience with analyzing SHAPE-Seq data 
shows that the structure cassettes tend to display negligible 
reactivities, and second, the observed counts X^^i and F„+i 
were very large compared to any other count. For example, 
in mapping experiments we conducted, X„_|_i amounted to 
approximately 10% of the total (+)-channel reads and Yn+i 
amounted to 15% - 20% of the total (-)-channel reads |[Tol . 
||9l , whereas the rest of the reads were associated with a total 
of 100 — 200 nucleotides. Based on these observations and on 
the premise that SHAPE-CE statistics should follow similar 
patterns, we believe that this approximation is fairly accurate 
in general. Nonetheless, one must keep in mind that this may 
not always be the case, especially when studying long RNAs, 
where cumulative dropoff effects result in severe signal atten- 
uation and consequently relatively weak full-length signal. In 



addition, we point out another subtle difference between the 
scheme's implementations under the two platforms. Specifi- 
cally, we may not assume that the CE-based Uk and Vk are 
always positive, whereas we made a similar assumption when 
we derived NGS-based estimates. That assumption was based 
on the fact that probing experiments can be designed such that 
a strong full-length signal arises. While this also applies to 
CE-based protocols, the absence of a full-length signal might 
complicate analysis whenever Xn = or Yn — 0. However, 
this can be easily remedied by converting these zeros into 
very small constants, such that the resulting approximations 
are negligible. 

To summarize, building on recent contributions to the 
automation of CE-based analog signal processing [17|, our 
method facilitates a simple and completely automated data 
analysis pipeline for CE-based chemical mapping probes, and 
in particular, for SHAPE-CE. Another interesting point arising 
from our derivation is that our ML scheme is invariant under 
background-signal scaUng. 

B. Effects of full-length signal information on ML estimation 

Our analysis highlights the fact that the difference in ML 
estimation between the two platforms lies essentially in the 
presence or absence of a full-length signal (compare Eqs. [7] 
and [19]). In this subsection, we first use our derivation to 
qualitatively explore the potential impact of this difference. 
We then quantify its effects by deleting the full-length signal 
information from SHAPE-Seq data to mimic SHAPE-CE data, 
such that the estimates under both platforms can be compared. 

Before we start, we stress that this difference between 
platforms pertains only to RNAs that are no longer than 400- 
600 nucleotides |4l, a limitation imposed by RT's imperfect 
processivity, as well as to RNAs that do not contain major 
transcription barriers that result in severe dropoff. RNAs of 
these two types are probed by annealing multiple primers 
at various sites lfT3]| . |fT9]| . Il20ll . in which case a full-length 
signal is not obtained from most primer locations even when 
the fragments are sequenced. In such settings, NGS and CE 
platforms generate similar information, and analysis should 
follow the lines of the CE framework. In what follows, we 
simplify the exposition by using the same notation for Xk 
and Xk, and similarly for Yk and Yk- 

To better understand the implications of not recording the 
full-length information, we rewrite the initial estimates as 



1 



Yk 



1 



Yk 



, (25) 



and consider the following two examples. 

1) Assume for simplicity that the (+) and (— ) RNA pools 
are the same size, i.e., Y^^=i -^i = ^nd 
suppose we probe a highly reactive molecule, or alterna- 
tively, use high reagent concentration. Both scenarios di- 
vert from single-hit kinetics toward higher modification 
rates, resulting in large dropoff in the (+) channel. As 
an extreme case, assume that the proportion of complete 
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Fig. 3. Reactivity estimates in tlie presence and absence of complete-fragment infoimation for 5. aureus plasmid pT181 sense RNA (a) and B. suhtilis RNase 
P RNA (b). Sites at the 3' end that showed negligible reactivities were omitted from display. 



fragments in the (+) channel is neghgible, such that 
Y^^=i ~ '}2"=i while it is significant in the (— ) 
channel. In this case, the difference between 1 — /3f 
and 1 — amounts to the difference between 



and Y ■ Because F„+i occupies a major fraction 
of the (— ) channel pool, the denominator of the right- 
hand expression is larger than the left-hand one, and 
consequently P^^^ > ■ This example illustrates 
that the missing information might lead to different 
estimates, and that under some scenarios, it results 
in under-estimation of all reactivities. Furthermore, the 
effect becomes more pronounced as we progress toward 
site n, since F„+i occupies increasingly larger fractions 
of X^fc^fc ^i- Poi" '^his reason, the relative reactivities are 
distorted as well, even when all the /3fe's are under- 
estimated. 

2) In this example, we still assume that Y^^=i Xi = 
J27=i '^^^ require both Xn+i and F„+i 

to represent significant fractions of the overall pools. 
We further assume that at the last two sites we observe 
Yn = Yn-1 — 3, Xn-1 — 30, and X„ = 6. Then, for 
site fc = n - 1 we obtain = !-(!- - 

|) = |, but on the other hand, when X„+i,y„+i are 
large enough (e.g., on the order of thousands), we have 
Pn-i ~ 0- Here, unknown large end-signals lead to 
the misinterpretation of minor (+) channel signals (or 
merely system noise) as indicative of high reactivity. 
This stems from misrepresentation of the molecular pool 
composition by Xi, . . . ,X„, and as before, the effect 
tends to intensify as we get closer to site n. 



These examples describe very specific and perhaps extreme 
scenarios, and clearly, it is difficult to predict the effect of a 
given pair X„+i,y„_|_i on the estimates of the entire RNA 
sequence, as these also depend on the underlying fragment- 
length distributions. While distortion may certainly be small at 
many sites, such differences accumulate when all estimates are 
jointly aggregated into an estimate of the modification fraction 
fmod- To demonstrate this cumulative effect, we first assume 
that the initial estimate B consists entirely of nonnegative /S^'s 
under both platforms, and so no zeroing is applied. It is then 
easy to see that 
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where the inequality is due to an unknown /?„ under CE 
settings (see discussion following Eq. [24| . For simplicity, we 
again assume equally sized (+) and (— ) RNA pools, in which 
completely determines , whereas is 



x„ 



case 

affected by site n's data ratio, as well as by Xn+i and 
y„+i's portions of the entire RNA pools (rather than by their 
ratio). It thus appears that the CE-based estimate is more 
susceptible to the actual experimental conditions and to noise 
toward the signal's end. Yet, in practice, many of the point 
estimates are zeroed out and so the formulation above may 
not accurately capture the true estimates. 

We conclude with two examples computed from SHAPE- 
Seq data, where we compare the NGS- and CE-based ML 
frameworks for the Staphylococcus aureus pT181 RNA and 



for the Bacillus subtilis RNase P RNA specificity domain. 
The CE case was analyzed by deleting the (+) and (— ) end 
signals, thus reflecting our assumption that SHAPE-CE data 
closely resembles SHAPE-Seq data, as we have previously 
observed for these two RNAs [9|. The estimated reactivities 
shown in Fig. [3] support our observations, and a clear trend of 
divergence between the estimates under the two frameworks 
is apparent toward the 5' end, where in these two cases the 
CE data result in over-estimation. Interestingly, the divergence 
in the fraction of modified molecules was minor in the pT181 
case {f^od ~ fnwd ~ 0.86) and amounted to approxi- 

mately 20% for RNase P « 0.62 vs. f^^^ « 0.52). Our 

results thus point out to a potential shortcoming of likelihood- 
based signal recovery schemes, when used in conjunction 
with CE systems. It is important to stress, however, that the 
previous and less automated method developed by the Weeks 
and Giddings labs |T2 | does not suffer from this shortcoming. 
This is because it relies on visual assessment of the signal's 
decay and its subsequent correction, whereas likelihood-based 
methods such as ours and the one reported in [T5^| explicitly 
utilize the observed frequencies to correct the signal. At the 
same time, relying on user feedback poses challenges to the 
reproducibility and accuracy of analysis, as discussed in ifTOl . 

V. Conclusion 

In this work, we presented a model and a maximum- 
likelihood framework, which lead to simple closed-form reac- 
tivity estimates, and which are applicable to chemical probes 
that use either CE or NGS for transcript quantification. We 
used this general framework to directly compare the estimates 
obtained with the two detection platforms, and concluded that 
lack of full-length signal information in CE settings degrades 
the estimates quality, and hence SHAPE-Seq is a more in- 
formative, and potentially more accurate, technique. Yet, it 
remains to determine the effects of this missing information on 
structure prediction's accuracy in order to clearly characterize 
the benefits of using new protocols such as SHAPE-Seq. 

Appendix 



To justify our claim that Eq. 12 pertains to the feasible 
ML solution, we first assume that (3k = and then optimize 
^fc(0,7fe) = Sk\og{l-jk) + {Xk+Yk)logjk to obtain 7^ as its 
maximizing argument. Hence, this point represents the maxi- 
mum over all points on one edge of the constrained optimiza- 
tion domain V = {(/3fc, 7fe) : < /3fc < 1, < 7fe < 1}. Now, 
assume that our claim is not correct, i.e., the maximum over 
T) is not attained at a point where f3k — 0. Then, there exists 
a point (/3fc,7fc) such that lk{hjlk) > hi/Slnt) > ^fc(0,7fc) 
for any < 7/j < 1. Clearly, {Pk^lk) must lie in Vs interior 
since Ik approaches — oo near all other three edges of V. Next, 
we construct a rectangular compact set £ = {(/3fc,7fc) : < 
/3fc < a < 1, < 6 < 7fe < c < 1} C X> around {h,lk), 
where we choose a,b and c such that lk{f3k,%) exceeds Ik's 
values over £'s boundary. This is possible due to the function's 
decline to — oo as (3k approaches 1 and as 7^ approaches {0. 1} 
and because lk{(3k, 7fc) is greater than Zfe's values at the fourth 



edge (where (3k — 0). Since £ is compact, Ik must attain a 
global maximum over it, but it follows from the construction 
that the maximum must lie in its interior This, in turn, means 
that this maximum must also be a stationary point (since Ik is 
differentiable over £), which contradicts the fact that the only 
stationary point is {(3k, Jk)^ which lies outside of V. 
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